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Abstract. The possibility of deriving the contact potentials between amino acids 
from their frequencies of occurence in proteins is discussed in evolutionary terms. This 
approach allows the use of traditional thermodynamics to describe such frequencies 

1-^ . and, consequently, to develop a strategy to include in the calculations correlations 

^^' due to the spatial proximity of the amino acids and to their overall tendency of being 

conserved in proteins. Making use of a lattice model to describe protein chains and 
defining a "true" potential, we test these strategies by selecting a database of folding 
^ ' model sequences, deriving the contact potentials from such sequences and comparing 

^^ . them with the "true" potential. Taking into account correlations allows for a markedly 

(^ [ better prediction of the interaction potentials. 
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Deriving amino acid contact potentials 2 

1. Introduction: the "quasi-chemical approximation" revisited 

The idea of obtaining the interaction energy between pairs of amino acids, in the form of 
a contact potential, from the statistical analysis of known proteins has been developed 
in detailed by Miyazawa and Jernigan in 1985 p]. Assuming that the contact formation 
can be regarded as a chemical reaction, they make use of Boltzmann statistics to relate 
the frequencies of occurence of a given pair of amino acids in proteins to the associated 
contact energy (this is the so-called "quasi-chemical approximation):!:. This idea has 
been further developed in refs. [1^1 El H^lIT^ . 

Since then, this method has undergone wide criticism, both from the point of view 
of its soundness and of its implementation. Finkelstein and coworkers challenged the 
validity of the basic assumption of the method [2]. They highlighted the fact that 
Boltzmann statistics arises as a consequence of transitions between states of a physical 
system, while pairs of amino acids are not free to move in the data set of proteins used 
to obtain the distribution of contacts between pairs of amino acids. In other words, 
different contacts are not excitations of the same system. Nevertheless, they suggest 
that contact probabilities still follow a Boltzmann distribution, but for a different reason. 
Assuming that protein sequences are random, their contacts will be also random, and 
a Boltzmann-like distribution of pair contacts arises as the ratio between the Gaussian 
distributions associated with random energies. In this respect, Finkelstein and coworkers 
interpret the temperature which characterizes this Boltzmann-like distribution as the 
freezing temperature of the random energy model in conformational space (see also ref. 
jHj). Ben-Naim has questioned whether the energy derived with statistical arguments 
corresponds to any physical process |3]. Thomas and Dill have emphasized the fact 
that, to derive a Boltzmann distribution, one assumes that each pair of amino acids is 
independent of all other pairs j3] , an assumption which is not valid because of the high 
density of amino acids in proteins. Moreover, they noted a possible problem in the fact 
that frequencies of occurence are obtained counting pairs of amino acids both in the 
same protein and in different proteins. They also show that the temperature and the 
reference state (the zero-energy state) are ill-defined (see also refs. JHIIZI)- 

It is our opinion that the analysis of amino acid pair distributions can be simplified 
if considered in an evolutionary context. In fact, evolution is the very dynamical process 
which controls which amino acids are in contact in the native conformations of proteins. 
The elementary moves of such dynamics are mutations, deletions and insertions, and 
take place on the timescale of millions of years. The misconception at the basis of the 
"quasi-chemical approximation" is the idea that the dynamical process which controls 
which amino acids are in contact is the conformational change which opens or closes a 

J Parallel approaches are those of Mirny and Shakhnovich , which optimize the Z-score (a quantity 
related to the energy gap between the native and the first-excited state of a protein sequence) 
simultaneously in a large set of different proteins. Vendruscolo et al. 9 and van Mourik et al jl()| 
perform, with different methods, similar optimization, under a zero-temperature approximation. Of 
course, several assumption other than the validity of Boltzmann distribution need to be verified, such 
as the two-body character of the potential [HI and its square-well shape ^2 ^1 |S] 
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contact, on the timescale of picoseconds. This dynamics is indeed present, and controls 
the stabihty of a contact (i.e. the probabihty that it is open rather than closed), but 
cannot be directly responsible for the likelihood of the amino acids which build the 
contact. 

In fact, the contact probabilities used to derive the contact energies are those 
calculated in the native state (i.e. crystallographic or NMR structures) which are, 
by virtue of Anfinsen's paradigm ^3, uniquely determined by the protein sequence. 
Consequently, the energy barrier which a protein must overcome to change a contact 
with another one is that associated with a change in the sequence (i.e., 3>kT, involving 
the whole replication machinery of the cell), the associated time scale being much longer 
than that of protein conformational changes (which take place at kT). 

From this view point, we can obtain information on the contact frequencies by 
analyzing the dynamics in the space of sequences at fixed conformation, rather than 
that in the space of conformations at fixed sequence. If one assumes that evolution has 
reached an equilibrium state §, then the contact frequencies observed in proteins can be 
identified with the equilibrium distribution of contacts associated with the underlying 
evolutionary dynamics. Whether we can assume that the dataset of known protein 
sequences represent thermodynamical equilibrium is a difficult issue. A suggestion that 
this is the case has been made by Rost in ref. fH]; noting that the distribution of 
sequence similarities for all known proteins is approximately Gaussian, centered around 
the random value 1/20. Note that, however, this suggestion has been questioned in ref. 

In order to be quantitative, one has to specify a model for evolutionary dynamics. 
The simplest model pictures an evolutionary step as a random mutation followed by a 
check on the ability the new sequence has to fold (i.e., if the mutated protein can still 
fold to a unique native conformation the move is accepted, otherwise the new sequence 
is discarded). Moreover, one can add an evolutionary bias toward low-energy proteins, 
assuming that more stable proteins are selected with higher probability (see below). In 
other words, the protein dataset is produced by subsequent mutations from a "parent" 
folding sequence. 

The selection rule, that is checking whether the mutated protein can fold, requires, 
in principle, a lengthy exploration of conformational space (e.g., by molecular dynamics). 
Fortunately, the selection rule can be further simplified. In fact, study of minimal models 
have shown that the folding ability of a protein sequence is mainly determined by its 
native state energy [201 |2I1 122] • Approximating the energy distribution of unfolded 
conformations by mean of the random energy model [2j (i.e., assuming that the contact 
energies in unfolded conformations are independent stochastic variables), it is possible to 
conclude that if a sequence displays in its conformational ground state an energy lower 
than a threshold Ec, determined by its length and by statistical properties (average 
and standard deviation) of the interaction potential (thus, independent of the detailed 

§ This implies that the evolutionary dynamics is ergodic and that it had enough time to converge to 
an equilibrium state. 
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Figure 1. The energies e predicted from Eq. ^ as a function of the "true" energies 
^true ygg^ |-Q design the sequences, for three different evolutionary temperatures T. 
Note that the energies are determined except for an additive constant. The correlation 
coefficients are 0.92, 0.86 and 0.88 for values of Tevoi of 80, 30 and 20, respectively. 
The associated slopes are 1.10, 1,10 and 1.02, respectively. 



sequence), then it is likely to fold to a unique native state (for a model study, see ref. 
[22] )• Moreover, the lower is the energy, the more stable is the native state. 

A consequence of these findings is that each point of sequence space can be labeled 
with a single quantity, namely the conformational ground state energy of the associated 
sequence, instead that with a whole conformational energy spectrum (that, in principle, 
would be necessary to assess the folding properties of a sequences). Evolution is then 
associated with energy fluctuations (i.e., exchange of energy between the protein and 
an ideal external reservoir), with the constraint that the native energy of a protein can 
never overcome the threshold E^. In accordance with classical thermodynamics, one can 
define a temperature T^^joi which control the equilibrium average native energy and the 
fluctuations about the average |24j. This temperature, which is not related in any way 
to the conformational temperature of the protein, has the meaning of an evolutionary 
pressure toward low-energy proteins. 

One has now to make three caveats concerning the model of evolution discussed 
above. First, it takes only into account the folding properties of a protein, but not 
other important features such as functional requirements, binding sites, etc. Second, 
only point mutations are allowed implying, among other things, that neither the length 
of the protein nor the threshold energy Ec change||. Third, it is important that the 
concentrations of different kinds of amino acids are constrained, otherwise at low values 

II A more realistic approach to protein evolution should allow for insertions and deletions, and thus for 
changes in both the length of the chain and of the energy Ec- Such a generalization of the problem can 
easily be accounted for, although in the following we will consider, for sake of clearness, only chains of 
fixed length. 
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of Tg^oi the system would display only those amino acids with the lowest interaction 
energy elements, and the chain would look more like a homopolymer (or oligopolymer) 
than like a protein. This would not only be unrealistic, since proteins typically display 
comparable concentration of the different types of amino acids (at least in terms of orders 
of magnitude), but also would challenge the folding ability of the protein, since the value 
of Ec depends on the average and standard deviation of the interaction matrix, weighted 
on the concentration of the amino acids. If such concentrations can vary, the value of E^ 
cannot be considered sequence-independent and the evolutionary algorithm described 
above fails. In the worst scenario (i.e., very low Te^oj with the Miyazawa-Jernigan 
interaction matrix), the evolutionary algorithm would produce mostly homopolymers 
composed of cysteine. To avoid this problem, we will fix (in Sect. ^ the concentration 
of different kinds of amino acids to their average values (cf. ref. pHp^. 

In what follows we work out the distribution probability of amino acid contacts. 
Using the standard derivation for closed systems (see, e.g., ref. |2S1), one can obtain the 
Boltzmann distribution p({cr}) = exp[—E{{a})/T]/Z for the probability of occurence 
of a sequence {a}, E{{a}) being the associated native energy (with E{{a} < Ec). 
Furthermore, one can consider each contact of the protein as a subsystem which, on the 
evolutionary timescale, can exchange energy (by mutations) with the rest of the protein, 
pictured as a heat bath. Making use of the basic approximation that each contact is not 
correlated to the others, the probability of observing a contact between amino acids of 
type p and r at equilibrium is 



p(p, ^) = ^ exp 



T 



(1) 



where e(p, r) is the interaction energy and T is a temperature-like parameter which 
controls the average energy of the contact. At equilibrium, the temperature of the 
system is equal throughout, and one can set T = T^^°K 

To be noted that the evolutionary Boltzmann statistics describing the distribution 
of contacts is different from the " quasi-chemical approximation" , because it takes place 
in the space of sequences and not in that of conformations. It is also different from 
that derived in ref. [2j, because it does not assume a protein database populated of 
random sequences. In particular, the temperature which controls the distribution is not 
a conformational temperature, but an evolutionary temperature. 

The assumptions made to obtain Eq. (^ are that the system is at equilibrium, 
that the contacts which build out a protein are uncorrelated and that the concentration 
of different types of amino acids can vary at will. While the equilibrium condition 
displays some soundness jTH], the others are quite crude. In the following, we will 
discuss corrections to Eq. (P) to take care of these problems, and test the validity of 
these corrections within the framework of a lattice model. 

^ Another solution is that of labehng each sequence with the Z-score instead that with the native 
energy, hke in ref. jHI- However this would make the following calculation more complicated. 
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Figure 2. The correlation lengtli at different temperatures for the SOmers model 
protein interacting through the Gaussian distributed matrix. 



2. The lattice model 

To test the validity of the approximation of Eq. (^ and to develop possible extensions, 
we follow the approach of refs. El QHI; where one choses an interaction matrix 
e*'^"'^(p, r) for model proteins, create a dataset of folding sequences according to this 
interaction matrix, derive from the sequence dataset the pair frequencies, and from 
these try to obtain the interaction energies by mean of relationships of the type given 
in Eq. (^ to be compared with the "true" energies. 

The model used for this purpose describes the protein as a chain of beads on the 
vertexes of a cubic lattice [SHI El 1221 • ^^ ^^^^ three kinds of interaction matrices 
^true^p^ r): two of them are extracted from ref. [T| (Table 6, which we shall call MJ, and 
Table 5, which we shall call MJ2), and the other one is a symmetric matrix generated 
extracting its elements at random from a Gaussian distribution. All of them have 
been shifted and rescaled to display zero mean and standard deviation 0.3 (in units of 
f^rproom _ g g kcal/mol). A particular feature of the matrix MJ2 is that it displays 
correlations between its elements, in such a way that can be well approximated by the 
two lowest eigenvalue terms in a eigenvalue decomposition [22] (naively speaking, most 
negative terms are concentrated toward one corner of the matrix and the most positive 
toward the opposite corner). The matrix MJ has been obtained by the matrix MJ2 
subtracting the effects of the solvent and does not display any feature which makes it 
distinguishable from a random matrix. The results obtained with the MJ matrix are 
substantially identical to those obtained with the random matrix. 

The dataset of sequences has been created fixing one (or more) compact structures 
and generating sequences with a Monte Carlo algorithm at temperature T'^^"K The pair 
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frequencies /(p, r) are then calculated counting the relative occurence of the contact 
between amino acids of kind p and r in the whole dataset. 

3. Spatial correlations 

Amino acids build complicated networks of interactions within globular proteins, 
networks which inevitably cause correlations between contacts. As a consequence, 
Eq. (0) provides a poor estimate of the interaction energies. In Fig. ^ it is shown 
the correlation between the "true" energies e*^"*^ used to design the sequences (in the 
case of the Gaussian matrix) and those obtained from Eq. (^, for different choices of 
the evolutionary temperature. An important parameter to quantify the validity of the 
energy prediction methods is the root mean square deviation (RMSD, i.e. the root of 
the reduced x^) associated with a linear fit of the predicted energies as a function of the 
"true" energies, which accounts for the typical variation of the predicted energies from 
the true ones. This parameter has to be compared with the energy scale of the system, 
which is T'^'""''. The values of RMSD associated with Fig. ^are listed in Table 1. 
In order to quantify spatial correlations we define the matrix 

G., = {e-ej)-{e,){e,), (2) 

where q is the total interaction energy of residue i with its neighbors (i.e., li = 
X]j e(o"i, o"j)A(|rj — rj\)), while the angular parenthesis indicate the thermodynamic 
average performed at temperature T'^'""'-, The elements of the correlation matrix decrease 
exponentially with respect to the distance Ar = |rj — r^l of the associated residues, 
allowing to define a correlation length A as G{Ar) ~ exp(Ar/A). The values of A as a 
function of the evolutionary temperature T are displayed in Fig. El 

It has been shown in ref. j22] that T'^'""'' ^ 30 is the evolutionary temperature 
below which good folder sequences are selected. One can notice that the correlation 
length A is not negligible below T'^^"'- ^ 30. This is not completely unexpected, since it 
is reasonable that good folders, unlike random sequences, display stabilizing long-range 
effects. 

Still lowering the evolutionary temperature below T ^ 10, the system enters a phase 
in which it is difficult to produce equilibrium distributions of sequences with the Monte 
Carlo algorithm, and which resembles the glassy phase of infinite disordered systems. 
Since the dynamics in glassy phases is very slow (see, e.g., ref. |2Z|), it is unlikely that 
evolution can have taken place below T ^ 10, and consequently we will disregard this 
phase in the following study. In the following we will develop a simple method to correct 
Eq. (Q) in order to account for such correlations. 

If instead of a single contact, one considers a subsystem composed of a contact and 
all its nearest neighbors (see inset of Fig. ^, the same argument used for the isolated 
contact suggests that the set of these subsystems follow a Boltzmann distribution, the 
probability of a contact between amino acids of kind p and r resulting (see also the 
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Appendix A) 

pZ-p r) = y^ g-/5[<:(P.T)+<:(P.7l)+<:{p,72)+e(p,73)+<:{T,74)+e{T,75)+e{T,76)] /g-j 

71 ■••76 

where the sum is over all the neighbors of the two residues of interest and f3 = 1/T, 
where T is the energy scale of the system (which must be provided as an input). Note 
that Eq. (jH} refers, as an example, to a case where each of the amino acids of the pair 
has 3 further neighbors (71, 72 and 73 are the neighbors of p, 74, 75 and 76 are the 
neighbors of r), but any number of neighbors is acceptable. Under the approximation 
that each residue is in contact with the same number F of other residues, one can rewrite 
Eq. (jH)) in the form 

(3e{p, t)= - \ogp{p, r) + (r - 1) log (t^ e'^^'^'^'A + 

+ {T -l)log(Te-^<-A -logZ. (4) 

This is a set of 20 x 20 implicit coupled equations which can be solved numerically to 
find the interaction energies e{p,T). The solution is determined except for the additive 
constant logZ, which cannot be determined by the set of p(p, r) alone (see also Sect. 
VI) . The physical meaning of Eq. (^ is to keep into account spatial correlations in the 
easiest way, assuming a uniform distribution of amino acids in the neighboring sites (see 
Appendix A) and without the need of calculating three- or more-body probabilities. 
Note that the approximation done in Eq. (j^ is not on the range of the correlations, 
but on the kind of coupling between residues. In fact, the energies which compare at 
the left side of Eq. (JH) are the "true" energies of the system, so that the account for 
correlations propagate to the whole size of the system. 

We first analyze sets of 10000 sequences interacting through the Gaussian matrix 
designed on a compact conformation built out of 80 residues (that used in ref. [SO])- The 
dependence of the results on the conformation (or conformations) used will be discussed 
in Sect. El 

The energies calculated solving numerically Eq. Q through 50000 iterations of a 
steepest descent optimization algorithm are displayed in Fig. El as a function of the 
true energies e^^^'^.The associated RMSD is listed in the third column of Table 1. These 
values have to be compared with the energy scale of the system, that is T'^^^K From 
these data it is clear that Eq. (^ performs much better than the use of pure Boltzmann 
statistics. 

A further refinement of the model consists in the keeping also into account of the 
interaction between nearest neighbors (i.e., between residues 71 and 74 and between 73 
and 76 in the inset of Fig. and consequently in Eq. Q ). The result is 

/?e(p, r) = - logp(p, r) + log (t^ e-^^'-'^'A + log (v^ e'^^^'A - 

+ log ( ^ ^-mP,-)+<-,-)M'^,r))\ _ jQg ^ ^5^ 
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Figure 3. The energies e predicted keeping into account the spatial correlations 
through Eq. Q as a function of the "true" energies e*''"'^ used to design the sequences, 
for three different evolutionary temperatures T. The correlation coefficients are 0.987, 
0.988 and 0.998, for values of T^yoi of 80, 30 and 20 respectively. The slopes of the 
linear fit are 0.99, 1.04 and 0.99, respectively. 



400 



300 



200 



100 



J'' 


(a) 


J" 




^^g^t$^^°:' 


°° (b) 


..^.us-AiSf*'./ 


* (c) 



-100 



-50 



50 



100 



Figure 4. In the case of sequences interacting through the MJ2 matrix selected at 
T = 30, it is displayed the energy e as a function of £*'""'=, calculated (a) through 
Boltzmann statistics (Eq. |^), (b) through the approximation of Eq. Q and (c) 
through the approximation of Eq. © . The dashed line indicates the steepness of the 
line e = e*''"'^. The correlation coefficients, calculated over the whole set, are 0.98, 0.93 
and 0.88, respectively. Restricting the correlation coefficient only to negative energies, 
it is 0.987, 0.992 and 0.984, respectively. 
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Table 1. The RMSD (standard deviation of the residuals) for the case of the Gaussian 
matrix, calculated in the Boltzmann approximation of Eq. ^ ("Boltzmann"), in the 
approximation of Eq. 01 (" neighbors" ) and in that of Eq. (" loops" ) . 



T 


Boltzmann 


neighbors 


loops 


15 


15.9 


7.0 


7.2 


20 


17.7 


5.0 


5.3 


30 


20.2 


5.1 


3.3 


50 


15.2 


1.9 


2.6 


80 


14.0 


1.7 


2.6 



This means considering that the amino acids neighboring to the contact of interest are 
not distributed uniformly, but, more realistically, display a distribution which depend 
on their own neighbors. The results, also listed in Table 1, show that Eq. (0) is roughly 
equivalent in the determination of the interaction energies than Eq. Q, although the 
numerical solution of Eq. © is definitely slower. 

Similar results are obtained for sequences interacting through the matrix MJ2. The 
correspondence between the energies e calculated in the three different ways discussed 
above (Eq. (P), Eq. (JD) and Eq. (0) ) and the true energies e*'""^ is displayed in Fig. 
m for the temperature T^'"°'' = 30. One peculiar feature of the results obtained from 
the MJ2 matrix is that the data associated with the Boltzmann approximation (cf. 
Fig. Ufa) ), although reasonably correlated, display an effective temperature which is 
approximately the half of the evolutionary temperature at which the sequences have 
been selected. This effective temperature could arise from correlations present in the 
interaction matrix (cf. Appendix B), as suggested by the fact that it does not appear 
in the case of the random matrix. 

Moreover, for all the three approximations there is a well-defined decrease in the 
quahty of the prediction at e*™^ > 0. This is due to a substantial decrease in the 
statistics associated with the data in this range of energies. The different behavior 
with respect to the random interaction matrix arises because here some amino acids 
are intrinsically more attractive than others, due to the order in the interaction matrix. 
Consequently, at low evolutionary temperatures, these attractive amino acids appear 
more often than the others. This does not happen in the case of the random matrix, 
where each amino acid interacts, in average, like any other. In other words, different 
amino acids display different chemical potentials in the case of the MJ2 matrix, while 
display the same chemical potential in the case of the random matrix. 

The RMSD related to the Boltzmann approximation, listed in Table 2, are 
calculated using the effective temperature, which corresponds to a rescaling of the actual 
temperature by a factor indicated in the fourth column of Table 2 (i.e., the slope of the 
linear fit of the function e(e*''"'^) ). While the features associated with the whole set of 
data are not very meaningful, being affected by the lack of statistics discussed above, 
the features associated with the range e*^"*^ < indicate a performance of the prediction 
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Table 2. The correlations between predicted and true energies for sequences 
interacting through the MJ2 matrix. The labels are the same as used in Table 1. 
The value in the column "slope" is the slope of the best fit to a linear function of 
the data generated by Eq. ^ The rows labeled by {E < 0) contains the result of fits 
considering only the negative values of £*'""'= . 





Boltzmann 


neighbors 


T 


RMSD 


slope 


RMSD 


15 


11.9 


1.7 


11.9 


15 {E < 0) 


8.8 




5.2 


20 


11.4 


2.3 


9.2 


20 {E < 0) 


8.6 




4.7 


30 


9.9 


2.3 


6.4 


30 {E < 0) 


8.3 




2.2 


50 


10.4 


2.7 


5.1 


50 {E < 0) 


8.4 




1.8 


80 


11.5 


2.9 


3.8 


80 {E < 0) 


8.7 




1.5 



methods comparable with that of the random matrix. Note that, for all purposes, the 
detailed knowledge of the attractive energies is much more critical that the knowledge 
of the repulsive. 

The values listed in Table 2 for e**""*^ < indicate that the Boltzmann approximation 
(Eq. (0) ) is better in the MJ2 case than in the random case. In any case, the 
approximation of Eq. Q gives better results in both cases. The behavior of the MJ 
matrix is identical to that of the random matrix (data not shown) . 

Note that the accounting for correlations described in Eq. Q can be easily 
integrated in any algorithm which derives potentials from pair statistics (e.g., ref. 
|13| I14j). Algorithms which optimize simultaneously the parameters which define the 
potentials, on the other hand, accounts implicitly for correlations between amino acids 
in the process of energy optimization. Nonetheless, the explicit treatment of such 
correlations seems to be fruitful. In the case of the algorithm employed in ref. [Hj, 
for example, the correlation coefficient between true and predicted energies is 0.84, 
while that associated with our method is, in the worst case, 0.88 (see caption to Fig. 



4. Correlations due to amino acid conservation 



Looking at the sequences generated at low evolutionary temperatures which produce 
the data analyzed in the previous Section, one notes an unrealistic feature, namely that 
at low Tejjoi the frequencies of the different kinds of amino acids are extremely uneven, 
some of the proteins displaying only a limited subset of all amino acids. Consequently, 
when working at high evolutionary pressure, could be desirable to constrain the relative 
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concentrations of amino acids. Of course, in performing this kind of approach, one gives 
up to obtain the concentration of amino acids as an emergent feature of the system 
and sets it by hand. The reason why we follow this strategy is that the concentrations 
of amino acids observed in real proteins are the result of a balance between different 
factors: not only the thermodynamical requirement of displaying a native energy well 
below Ec, but also the need not to display too many hydrophobic residues to prevent 
aggregation, to have enough amino acids of a given kind to perform some biological task, 
etc. The model we employ is too simple to account for all these effects, and accordingly 
we do not attempt to obtain the concentrations from the model, but we fix it a priori. 

Thus, we have repeated the above calculations using the MJ2 matrix and adding the 
constrain that the concentration of each amino acid is constant and equal to each other 
(this is a simplification, since different amino acids occur with different probabilities 
in real proteins ^^. The generalization is straightforward). This constrain introduces 
nonlocal correlations in the sequence, since the occurence probability of a pair of amino 
acids now depends on the presence of other amino acids of the same kind everywhere 
else. 

In order to keep into account these constrains, we employ a grand-canonical 
formalism, introducing chemical potentials /Iq for each kind of amino acid a. The 
Boltzmann (Eq. (^) and loop (Eq. Q) approximations become 

/3e(p, t) = - \ogp{p, r)+ pp + p^- log Z (6) 

(3e{p, t)= - \ogp{p, r) + (r - 1) log f V e-^^<P'^^-^A + 



(r - 1) log f ^ e-'3(^(-'^)-^-M +pp + pr- log Z, (7) 

respectively. The problem which arise is now the determination of the chemical 
potentials (we will search for chemical potentials which make each kind of amino acids 
equiprobable. The generalization is straightforward). For this purpose we will employ 
a mean field approximation, where 

1 20 

that is assuming that inserting an amino acid of given kind into the system costs a free 
energy given by the average interaction of that amino acid with all other kinds. 

In Fig. Efa) it is displayed the result obtained making use of the canonical 
Boltzmann approximation (Eq. (Q) ) for sequences selected at T^^°' = 30. The 
correlation coefficient is very poor (0.66), the slope of the linear fit is 0.23, which gives an 
effective temperature of T^^^i = 30 • 0.23 = 6.9. The RMSD is 7.9, that is 1.14 times the 
effective temperature. In Figs. Efb) and (c) are displayed the results obtained with the 
gran-canonical formalism of Eqs. © and (|7j), respectively, with the chemical potentials 
calculated from Eq. (jH)). While the grand-canonical Boltzmann approximation (jH)) 
gives a correlation coefficient of 0.89, the grand-canonical loop approximation of Eq. 
(jZj) gives a correlation coefficient of 0.97. The slopes of the linear fits are 1.02 and 
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Figure 5. The energies predicted from sequences interacting through the MJ2 matrix 
and in which the amounts of different kinds of amino acids are kept fixed. The 
prediction is performed through (a) Eq. JSJ, (b) © and (c) Q. The evolutionary 
temperature is T^voi — 30. The correlation coefhcients are 0.66, 0.89 and 0.97, 
respectively. 




Figure 6. The RMSD (solid curve, left axis) and the correlation coefficient (dashed 
curve, right axis) as a function of the number of sequences A^seij used to collect statistics, 
using the random matrix and T = 30. 



0.98, respectively. The associated RMSD are 7.9 and 9.0, respectively, which are 
approximately one fourth of the energy scale of the system. 

Although the approximation of Eq. (jHl) performs much better than the canonical 
Boltzmann approximation (Eq. (Q) ), it performs essentially as well as the approximation 
of Eq. ((Zj) . This could be due to the fact that the correlations arising from amino acids 
conservation dump the correlations associated with local proximity of amino acids. 
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5. Database dependence of the results 

In order to study what is the amount of sequences needed to obtain a good estimate of 
the interaction energies, we have calculated the RMSD and the correlation coefficient as 
a function of the number of sequences Ngeq used. The energies are calculated by mean 
of Eq. (j3)) at T = 30. The plot, shown in Fig. IHl displays a marked worsening of the 
predicting capability of the algorithm around Ngeq ~ 100. Since each sequence displays 
105 contacts between its amino acid, it means that one needs ~ 10^ contacts to obtain 
reliable results. Note that at Ng^q ~ 100 there are 9 pairs of amino acids out of 210 
which never appear in the statistics, a number which grows to 101 for Ngeq ~ 10 and to 
162 for Nseq ~ 1- All these pairs display large, repulsive "true" energies. 

When studying real protein sequences one usually does not collect pair statistics 
from many sequences"*" folding to the same conformation, but rather from sequences 
folding to different conformations. We have studied a set of 100 sequences, 
optimized on 10 different 80mer compact conformation (10 sequences optimized on 
each conformation). The RMSD results 13.1 and the correlation coefficient 0.90, to 
be compared with 13.3 found using a single conformation. This suggests that there 
is no difference in using proteins with different native conformations to collect pair 
statistics. 

Repeating the same calculation using model proteins of different length (27, 36, 
48 and 80) provides a RMSD of 16.2 and a correlation coefficient of 0.86. This slight 
under-performance appears to be connected with the fact that the value of T in Eq. (JH) 
depends on the length of the protein, through the ratio of bulk/surface residues, and 
consequently cannot be considered fixed. This problem becomes more serious in the 
case of real proteins, where the continuous character of the degrees of freedom makes 
the coordination number of residues more variable than in the case of lattice models. 
However, one could expect that this is partially compensated by the fact that real 
proteins are typically longer than the lattice chains we have employed, and consequently 
the ratio of bulk/surface residues is larger (at least, for globular proteins). In the case of 
the set of proteins used by Miyazawa and Jernigan jlj, the mean coordination number is 
10 and the standard deviation is 4. To reduce the standard deviation one can, following 
ref. [IJ, consider only interior residues, that is residues within 7 A from the centre of the 
protein, obtaining a mean coordination number of 6 with a standard deviation of 1.7 
(to be compared with the mean and the average associated with lattice model proteins, 
that is 4 and 1.7, respectively). 

To model evolution, we have assumed a constant evolutionary temperature, which 
also sets the energy scale of the resulting interaction matrix. While that lack of 
knowledge on the detailed value of this temperature causes no problem in the efficiency 
of Eq. (j3]), since the energies are simply rescaled according to T*^^"', problems arise if 
the set of sequences is selected making use of different temperatures (which is quite a 

+ Of course these should be statisticahy independent sequences, that is, in bioinforniatical language, 
non-homologous sequences. 
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realistic situation). Using a dataset composed of 333 sequences selected at T^^°' = 20, 
333 sequences selected at T^"""' = 30 and 333 sequences selected at T"^™' = 50, we obtain 
an effective temperature of T*^^°' = 24.7, a RMSD of 7.3 (instead of 3.3, see Table 1) 
and a correlation coefficient of 0.95 (instead of 0.99). 

6. Conclusions 

We have shown that, looking at the set of proteins from an evolutionary point of view, 
it is possible to obtain contact energies between amino acids in a sound way, making 
use of standard thermodynamics (although used in the context of finite systems). This 
approach also allows to understand which are the limitations in the approximations of 
the method, as, for example, the presence of correlations between amino acids, and to 
develop corrections to those approximations. 

For sequences generated without any constrain on their composition, accounting 
for correlations allows to decrease the RMSD from 20.2 to 5.1 in the case of the MJ 
matrix and from 8.3 to 2.2 in the case of the negative elements of the MJ2 matrix (the 
correlation coefficients increasing from 0.86 to 0.99 and from 0.98 to 0.99, respectively). 
Costraining the composition of proteins, the RMSD (relative to the energy scale T^^oi) 
decreases from 1.1 to 0.2 in both the case of the grand-canonical independent-contact 
approximation and of the grand-canonical formalism with correlation corrections (the 
correlation coefficients being 0.66, 0.89 and 0.97, respectively). 

Of course, there are still strong approximations in this approach, as the step-shape 
of the potential and its isotropic character. On the other hand, these are computational 
limitations which do not undermine the soundness of this approach, being easy to extend 
it to situations displaying more complicated shapes of the potential function (see, e.g., 
ref. ^5)- Another problem which has not been dealt with is the determination of the 
zero of the energies (i.e., the quantity — TlogZ). This is a key issue when one wants to 
use these potentials to determine, for example, binding constants, but it is less critical 
in folding studies. Anyway, the determination of this parameter is something which 
cannot be done from the contact probabilities alone, but requires further ingredients, 
and consequently goes beyond the purpose of the present work. 

It is interesting to repeat the calculations performed by Miyazawa and Jernigan, 
deriving the frequencies of occurence of amino acid pairs in the same database of 
real proteins used by them jT]. Comparing the energies obtained making use of the 
Boltzmann-statistics approach (i.e., Eq. (^, in the spirit of the derivation of ref. jl]* 
with the corrections discussed above, we obtain energies which are substantially different 
from theirs. Taking care of the spatial correlations (Eq. (jl} ), we obtain a set of energies 
whose correlation parameter with the Miyazawa- Jernigan set is 0.91 and whose RMSD 
is 0.28 T. In keeping into account also the correlations due to amino acids conservation 
(Eq. (jni)) the correlation coefficient becomes 0.96 and the RMSD 0.17T. 

* Disregarding in this comparison the corrections done in ref. ^ to account for the effects of the 
solvent. 
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Appendix A. 

If one assumes that a protein is a closed system, the equilibrium probability for a 
sequence cxi, (T2, ..., o"7v is given by 

p({a,}) = iexp[-/3^6(a„a,)A(h-r,.|)], (A.l) 

where A(|rj — rj\) is a contact function which contains the information about the 
coordinates {rj} of the (fixed) native conformation. Since we are interested in the 
probability of a single contact (say, between ctj and aj), we can make the interaction 
e(cri, cTj) explicit, as 

p(a„a,) = ^e-^^('^-'^^) E exp[-/3(e((T„ a^) + e(afc, a,) + ...)]• (A.2) 

Focusing our attention on the nearest neighbors (e.g., k) of the ith and jth contacts, we 
obtain 

1 



E exp[-/?(e(crfc,a-i) +e(cr«,crm) + •••] + ••• (A. 3) 






which can be shortened as 

P(^i,^.-) = ^e-P<'^-''^^Y.^-P'^''''^''\{a,) + ... (A.4) 

where g(o-fc) = Ea,,<7,„,... exp[-/3(e(crfc, ci;) + e(ori,(Jm) + ...)] is proportional to the 
probability that the fcth site of the protein is occupied by amino acid ak due to the 
effect of all other residues of the protein except that in the ith site. 

The approximation of Eq. (0) consists in assuming that q{a) = 1, meaning that 
each kind of amino acid has the same probability of occupying a site in contact with 
the pair i — j under study. 

Appendix B. 

The existence of an effective temperature arises from the the correlations present in the 
MJ2 correlation matrix. To illustrate this point, consider the example of an interaction 
matrix which emphasizes the hydrophobic character of the amino acids, in the form 
e(p, r) = x{p) + x{t), then Eq. Q becomes 

■^ [ J-evol J \ 7 / 

which approximates a Boltzmann distribution with an effective temperature T^^^i = 
Tevoi/^ smaller than the real temperature Te^oi- 

This is the simplest kind of correlation which could affect the system. It has 
been suggested in ref. ^26j that the interaction matrix MJ2 has the form e(p, r) = 
x{p) + x{t) + Xx{p) ■ x{t), corresponding to considering only the first two terms in the 
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Eigenvalue decomposition of the matrix. This, or more comphcated, kinds of correlation 
cannot be accounted for analytically, but are likely to be responsible for the appearance 

of an effective temperature. 
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